Add a HybridReader for use in write constrained databases#423
Open
jimhester wants to merge 5 commits intoposit-dev:mainfrom
Open
Add a HybridReader for use in write constrained databases#423jimhester wants to merge 5 commits intoposit-dev:mainfrom
jimhester wants to merge 5 commits intoposit-dev:mainfrom
Conversation
Wraps any Reader (the data side) with an in-process DuckDBReader (the staging side). register() writes to staging; execute_sql routes whole queries to staging or the primary based on whether they reference any registered name. Behind the existing 'duckdb' feature. Tests cover the routing scanner (identifier-boundary checks, qualified references, double-quoted identifiers, string-literal exclusion), register/unregister name tracking, dialect dispatch, and the documented cross-side limitation.
Per code review: the original tests for routing direction and dialect selection used identical setup on both sides, so they passed regardless of impl correctness. The dialect test now uses a SqliteReader on the data side (SQLite dialect) so the staging-vs-data distinction surfaces in sql_greatest output, and the cross-side test now registers staged_only in both data and staging with different values so a wrong-route would succeed silently rather than erroring for the same reason as the correct route. Also corrects an inverted "false-negative" label and softens the misleading "comments are harmless" note in the references_staged_name doc-comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
HybridReader, aReaderthat composes any primary reader (the"data" side) with an in-process
DuckDBReader(the "staging" side).register()writes go to staging;execute_sqlroutes queries thatmention any registered name to staging, and everything else to the
primary.
Behind the existing
duckdbfeature flag — no new feature, no newdependencies.
Companion design comment with the broader sequencing context:
#341 (comment). Related to but implementation distinct from #422
Motivation
Some data sources are read-only by nature (Flight SQL servers, anonymous
Trino) or expensive to write to repeatedly during visualization
iteration (Snowflake).
HybridReadercomposes a primary reader (theremote data source) with a local DuckDB instance (staging).
register()writes to staging — sidestepping read-only or auth restrictions — while
execute_sqlroutes queries to the right side based on which tablesthey reference. Same
Readerinterface; no caller-visible difference.The design also pairs naturally with a query-result cache (PR3) that
memoizes remote query results in the staging DuckDB. The cache isn't in
this PR, but the staging plumbing it relies on is.
Design
HybridReaderowns:data: Box<dyn Reader + Send>— the primary backend.staging: DuckDBReader— an in-process DuckDB instance.staged_names: RefCell<HashSet<String>>— the namesregister()hasput into staging.
The routing predicate
references_staged_nameis a lightweight SQLscanner — not a full parser. It checks whether any registered name
appears as a SQL identifier (with identifier-boundary respect, qualified
references like
catalog.schema.name, double-quoted identifiers, andsingle-quoted-string-literal exclusion). Comments are not currently
parsed: a stray identifier inside a
-- commentcould route aprimary-data query to staging, where it would fail with a clear error
rather than succeeding against the primary backend.
Reader::dialect()returns the staging DuckDB dialect, because allinternally-generated SQL (stat transforms, layer filters, temp-table
DDL) targets staging. Callers that need the primary's dialect (e.g.
schema introspection of the remote catalog) get it via the inherent
HybridReader::data_dialect()method.Limitations (documented)
A single SQL statement cannot reference both staged names and
primary-data tables. Queries are dispatched whole; cross-backend joins
are unsupported. Materialize one side into staging first if you need to
combine them. There is a regression test pinning this behavior.
Staged data lives in the in-process DuckDB instance and is released
when the
HybridReaderis dropped — no spill-to-disk, no shared cache.Testing
All tests are offline, no external setup:
single match, rejection of longer-identifier overlap (
ordersshouldnot match
orders_detail), rejection of identifier-prefix overlap(
colshould not matchcol_id), rejection of single-quoted-stringcontents, match of double-quoted identifiers, match of qualified
references (
catalog.schema.orders), and SQL-standard''escapeinside a string literal.
registerdelegates to staging andtracks the name;
execute_sqlroutes a registered name to staging;execute_sqlroutes an unregistered name to data;unregisterdelegates and untracks;
dialect()returns the staging dialect witha discriminating SQLite-on-the-data-side setup.
and primary-only names routes wholly to staging and surfaces a
staging-side error rather than silently joining. The setup
discriminates correct routing from a wrong-route that would
otherwise succeed.
The dialect-discrimination test uses a
SqliteReaderfor the dataside (Ansi
CASE-formsql_greatest) against a DuckDB staging(
GREATEST(a, b)), so a regression that returned the data dialectinstead of staging's would fail visibly. Gated on the
sqlitefeature,which is in upstream's default feature set.
What's next
Per the design comment, a follow-up PR adds:
(
hybrid_cache.rs), aReader::clear_cache()trait default,Vega-Lite v5+v6 mime emission in the Jupyter kernel, and the
-- @uncacheJupyter meta-command.The cache makes the iterate-on-remote case sub-millisecond on cache
hits while keeping the same
Readerinterface; it's gated by an envvar and fronted by a public
CacheConfigfor callers that want totune TTL or the byte budget.